SUMMARY OF THE INVENTION 

The present invention provides a method and system for inputting Chinese 
characters into a computer. The invention improves the ease of use as well as 
efficiency of inputting Chinese characters over the prior art. Ease of use and 
5 efficiency are inherently conflicting goals in Chinese character input systems. 

According to a first aspect of the invention, some of the 200+ components 
(also called radicals in the literature) used to construct Chinese characters is 
assigned representation by one of the letters in the English alphabet. This set of 
selected components is sufficient to construct any Chinese character of interest. 

10 Each Chinese character of interest to the present invention is assigned an 
"encoding", being a text string in the English language, with each letter of the 
string corresponding to the Chinese character component as defined by the 
present invention. This is standard practice in the prior art. In the prior art, the 
input systems match a given text string against the set of encodings (the library) 

1 5 letter for letter. An input string that matches one in the library selects the Chinese 
character associated with that encoding. This technique requires the user to 
accurately memorize the exact encoding assigned to every Chinese character, a 
monumental task prone to error, confusion, and forgetting from disuse. The 
present invention uses a novel technique in order to reduce the amount of 

20 memorization required of the user. I n add i tion to th e s e t of prodof i n e d e ncodings 
(th e l i brary), th e prooont i nvent i on also dofinos two " e qu i va l ence" tables, a 
"forward" e qu i valenc e tab l o and a "backward" e qu i va l enc e tab l o. Thes e tab le s 
d o f i no, for e ach l ottor of tho Eng li sh a l phab e t, a sot of str i ngs wh i ch aro to b e 
cons i dorod "equ i va l ent" to that l ett e r dur i ng a compar i son oporat i on. Wh e n 

25 comparing an i nput text str i ng aga i nst on e from the li brary, th e two str i ngs ar e not 
s i mp l y compar e d lo ttor for lo ttor. I nst e ad, e ach le tter i n th e i nput str i ng i s furth e r 
expand e d i nto th e s e t of pred e f i n e d str i ngs giv e n by th e forward e qu i va l ence 
tab le . Thus, i f tho le tt e r 'a' i s d e f i n e d i n th e forward e qu i va l enc e tab le as 
cons i st i ng of tho sot of str i ngs {'be', 'd e f , 'h i jk'}, thon tho input str i ng "a" w il l 

30 match li brary str i ngs "a", "be", "d e f, and "h i jk". Th i s techn i qu e is app li ed to e very 
l ett e r i n an i nput str i ng. S i m il ar l y, th e backward e qu i val e nce tab le is app li ed to a ll 



l etters i n otr i ngs dofinod i n th o li brary. Thuo, i f th e le tter 'a' i s dofin o d in the 

backward e qu i va l ence tab l e ao equ i val e nt to th e s e t ("zy", "xwv", "utor"), th e n a 

li brary str i ng "a" w ill match tho i nput str i ngo "zy", "xwv", and "utor". Tho forward 

and backward equ i va l enc e tables are app li ed i n e v e ry compar i son. Th e n e t r e sult 

5 i s a substant i a l r e duct i on i n the amount of memorizat i on i mposed on th e us e r. An 

examp l e w ill more c l ear l y ill ustrate th i s tochn i qu e . 

For examp l e, th e Ch i nes e character Bfoan bo constructed by us i ng th e 
components "Fj" and or the compon e nts "Fj", V ", and "F -V&t^he 

components "B", V -Vand-* ^, or tho components "— ", and u/ p". Th e r e i s 

10 no standard def i n i t i on as to wh i ch compos i t i on i s th e "off i c i a l " one. I n th e pr i or art, 
tho user must provide th e exact cot of compon e nts i n th e e xact sequence as 
dofinod by th e des i gn e r i n order to g e t a match. (Some m e thods d e f i n e mu l t i pl e 
soquonc o s that map to th e samo character but that i s on l y done for som e 
characters and sti ll requir e s oxaot match of any of tho predef i ned equ i va le nt 

15 sequences). This pract i ca ll y requ i res th e usor to memorize tho e xact encod i ng 
for every Ch i nes e character. I n tho present i nv e nt i on, an un li m i ted numb e r of 
variat i ons ar e al l ow e d in descr i b i ng a charact e r construction to th e i nput method. 
I n tho above e xamp l e, any of tho poss i b l e d e scr i pt i ons w ill resu l t i n i dent i fy i ng th e 
character. A more detai l exp l anat i on of how th e match e s occur fo l lows. 

20 " B" i s i ts el f a comp l ete Chinese character, and a l so a common l y 

occurr i ng component used i n construct i ng other characters. As a character, i t i s 
composed of th e compononts "□" and ", and as a compon e nt, i t i s mapp e d to 

ono of tho 26 l etters of th e Eng li sh a l phabet, say 'a'. S i m il ar l y, u/ f " i s a l so i ts el f a 

Ch i nese character but i s not a component us e d common l y e nough i n th e 
25 construction of other charact e rs to warrant ass i gnm e nt to r e presentat i on by a 
des i gnated Eng li sh a l phabet. As a charact e r, i t i s compos e d of th e compon e nts 
"J ", ", " [ ", ", and "— ". Suppos e tho compon o nts "J ", " | ", a nd 

" aro mapped to tho a l phabetic l etters 'o', 'j'. and 'h' r e sp e ct i ve l y. Thus, th e 

charact e r can b o descr i b e d by the e ncod i ng "ajh i hh", a l though that's not th e 
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on l y poss i b le e ncoding, just tho ono se le ct e d by th e d e s i gn e r. Howovor, as 
oppos e d to th e pr i or art, th e us e r is not r e qu i r e d to prov i d e th i s exact e ncod i ng in 
ord e r to i dent i fy the charaotor B^. I notoad, as tho fo ll ow i ng tab l o shows, tho usor 
can prov i de any of a numb e r of vary i ng i nput str i ngs basod on what the user 
perc ei v e s as th e compononts of th e charaotor B^, which may or may not b e th e 

same as what th e i nput method des i gn e r has defined: 
I nput String Def i n i t i o n Result Reason 
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th e forward e qu i val e nce tab le 



ohjh i hh ajh i hh 
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d e f i nes 'a' to b e e qu i va le nt to 'jh'. 
Th e r e for e , tho s e cond 'a' i n i nput 
str i ng match e s th e 'jh' i n th e 
li brary e ncoding str i ng, and th e 
rest match le tt e r for l ett e r 
th e backward e qu i va le nc e tab le 



oha i hh 



ajhihh 



match 



d e f i n e s 'a' to b e e qu i val e nt to 
'oh'. Th e refore, the 'oh' i n th e 
i nput str i ng match e s th e 'a' i n th e 
li brary encod i ng str i ng, and th e 
rest match l ett e r for le tt e r 
any comb i nat i on of forward and 



backward o quiva l onco tab le 
match i ng i s a ll ow e d. Th e r e for e , 
'oh' matches 'a', and th e n 'a' 
match e s 'jh' 



In a first aspect of the present method, a novel technique is used to 
encode Chinese characters. In the prior art, each Chinese character is encoded 
as a string of English letters. This string is then compared to user input in order to 
30 find a match. In the present method, each Chinese character is not encoded as a 



string of letters, but as a data graph. This is a technique described in the Finite 
State Automata field of Computer Science- 
According to the theories of Finite State Automata (FSA). a Chinese 
character can be described bv a Non-deterministic Finite State Automata (NFA). 

5 An NFA is a structure which has multiple representations in multiple parts of the 
structure. A Chinese character is generally composed of other simpler Chinese 
characters. These simpler characters are in turn composed of even simpler 
characters, and so on. until finally indivisible strokes. This type of structure fits 
the definition of an NFA and all the technioues developed for NFA analysis can 

10 be applied. In the prior art, each Chinese character is reduced to a string, 
resulting in a loss of the inherent hierarchical structure of the character- 
Describing a Chinese character as an NFA preserves the inherent structure and 
has useful benefits. 

For example, the character " 3£" can be represented bv the following 

15 graph: 
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The interpretation of this graph is that the character can be described bv 
25 multiple seguences of components: 

1. 3E - ffl 

2. zri— 1 

3. 3EB± 

4. ^n— ± 

so 5. 3ED — [ — 



6. ^H+- 

7. :nbrB± 

8. —f — Qzr± 

9. — | — □ — b 

10. — f — H-i — 



The multiple descriptions of this character are a result of the character's 
inherent hierarchical structure, as depicted in the graph. Each distinct description 
represents a unique path of traversal through the graph. Graphs like these are 
10 typically used to describe NFA's. 



The benefit to the user is a reduction in the amount of memorization 
reouired in Chinese data entry. In every graph, the leaf nodes are one of a few 
fundamental strokes. Therefore, a beginner user only needs to memorize these 
15 few fundamental strokes and can enter any character using just these strokes, in 
essence traversing the bottom level of the graphs. As the user gains experience 
and learns more high level components, he will gradually ascend to higher level 
paths throught the same graph, resulting in fewer components used in describing 
the same character, thus increasing typing speed. 

20 

As mentioned earlier, once characters are represented as NFA's. the 
technioes of Finite State Automata theory can be applied in processing the 
NFA's. In particular, claim 1 describes a technigue wherein a whole sub-branch 
in a graph can be matched to a single user input symbol bv eguating the symbol 
25 to a string of symbols which are the flattened contents of the sub-branch. This a 
technigue commonly known as reducing an NFA to a DFA (Deterministic Finite 
State Automata). 



30 



In a second aspect of the present method, a "partial match" algorithm is 
used to further increase the intelligence of the encoding comparison operation, to 
add i t i on to a ll ow i ng on e or mor e "w i ldcard" charact e rs i n a g i v e n s e qu e nc e to 



match ono or moro unsp e c i f i od substr i ng of le tt e rs i n an e ncod i ng. Whereas 
explicit "wildcard" letters could be supplied bv the user in an encoding, an 
"tmpUe dimplicit " wildcard is automatically created by the present invention 
whenever a given input sequence does not yield any matches. Thus, suppos i ng 
5 — i s a w il dcard charactor, tho input soquence "*jh i hh" w ill match the encod i ng for 
Efp, but "a i hh" w ill a l so match i t. T his aspect of the present invention automatically 

skips over non-matching text runs within an input string while continuing to 
perform comparisons for matching runs, resulting in a comparison process that 
accepts partially matching input sequences. 

10 

In a third aspect of the present method, a novel way of resolving conflicts 
among characters having the same encodings is devised. Occasionally, more 
than one Chinese character are composed of the same exact components, the 
construction differing only in the relative placement of the components. To 
15 resolve these ambiguous encodings, an additional letter with a prescribed 

semantic of positional description is appended to each conflicting encoding. Fig. 
2 contains an example illustrating this novel technique. 

In a fourth aspect of the present method, a novel way of selecting 
20 characters matched by the input method is devised. Whenever more than one 
candidate character matches a user given letter sequence, the candidates are 
presented to the user for a manual selection. In the prior art, a number is 
sometimes used as a means of specifying the user choice. While a number is 
obvious in its meaning since a linear list of candidates are offered up for 
25 selection, the present invention chooses to use an alphabetic letter instead. 
Thus, the letter 'a' signifies choosing the first candidate, 'b' the second, and so 
forth. The use of an alphabetic letter instead of a number is non-obvious and has 
never been done in the prior art, as it is not always possible for any given input 
method since the alphabetic letters are used for encoding Chinese characters 
30 and may confuse the system if also used as candidate selection keys. This 

aspect of the present invention is significant in that it allows the user to keep his 



fingers on the basal touch typing position (as opposed to having to move them 
away to type a number), resulting in faster typing speed. 

In a fifth aspect of the present method, a novel way of attaching additional 
5 information to an input string is devised. Since the present invention only 
employs the 26 lower case alphabetic letters in constructing input sequences, 
letters outside of the employed set can be and are used as carriers of additional 
information about the input sequence. For example, the input sequence "abc6-9" 
is interpreted to mean 'match all characters defined by the encoding "abc" and 
10 with a stroke count of 6 to 9'. Another example is any input sequence beginning 
with an uppercase letter is defined to mean "pass through", which means the 
given input sequence is made the output without interpretation, creating an 
efficient way of entering English sentences in the midst of Chinese characters. 
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